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Abstract 

Deep Boltzmann machines are in principle powerful models for extracting the hierarchical 
structure of data. Unfortunately, attempts to train layers jointly (without greedy layer- 
wise pretraining) have been largely unsuccessful. We propose a modification of the learning 
algorithm that initially recenters the output of the activation functions to zero. This 
modification leads to a better conditioned Hessian and thus makes learning easier. We 
test the algorithm on real data and demonstrate that our suggestion, the centered deep 
Boltzmann machine^ learns a hierarchy of increasingly abstract representations and a better 
generative model of data. 
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1. Introduction 

Deep Boltzmann machines (DBM, I Salakhutdinov and Hintonl. l2009l) are in p r incip le power- 
ful models for extracting the hierarchical structure of data fJMontavon et all 12012. ). Unfor- 
tunately, attempts to train layers jointly (without greedy layer-wise pretraining) have been 
mostly unsuccessful. As we will argue later in greater detail, a possible reason for this could 
be that the mapping of net activities onto the sigmoid nonlinearities is not centered to zero 
by default. 

In this paper, we propose to recenter the output of each unit to zero by rewriting the 
energy as a function of centered states ^ — x — ^ where /3 is an offset parameter. The 
reparameterization of the energy function leads to a better conditioned Hessian of the 
estimated model log-likelihood. The centered Boltzmann machine is easy to implement as 
the reparameterization leaves the associated Gibbs distribution invariant. 

We train a centered deep Boltzmann machine on the MNIST data set. Empirical results 
show that the centered DBM is able to learn a top-layer representation that contains useful 
discriminative features and to produce a good generative model of data. In addition, the 
centered DBM learns faster and is more stable than its non-centered counterpart. 

Related work T he c ase for using; cente r ed no nlinearities has already been made by 
LeCun et all (119981 ) and IClorot and Bengiol (l20ld ) in the context of backpropagation net- 
works, showing that the logistic function generally performs poorly compared to its cen- 
tered counterpart, the hyperbolic tangent. The idea of centering was also proposed by 
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Tang and Sutskeverl (|201ll ) in the context of restricted Boltzmann machines but was re- 



stricted to data centering. 

2. Centered Boltzmann Machines 

In this section, we introduce the centered Boltzmann machine. In the fohowing, the sigmoid 
function is defined as sigm(x) = ^fei? ^ ^ ^(p) denotes that the variable x is drawn 
randomly from a Bernouilli distribution of parameter p and (•)p denotes the expectation 
operator with respect to a probability distribution P. All these operations apply element- 
wise to the input vector. 

A Boltzmann machine is a network of Mx interconnected binary units that associates 
to each state x G {0, 1}^^ the energy 

E{x;e) = -x'^Wx-x'^b 

where = {VF, 6} groups the model parameters. The matrix W of size Mx x Mx is symmetric 
and contains the connection strength between units. The vector b of size Mx contains the 
biases associated to each unit. A probability is associated to each state according to the 
Gibbs distribution 

p-E(x',0) 

p{x;0) 



where the term in the denominator is the partition function that makes probabilities sum to 
one. For the centered Boltzmann machine, we rewrite the energy as a function of centered 
states 

E{x; 6) = -{x - P)^W{x - /3) - (x - /3)^6 

where 9 = {VF, 6, /?} and where the vector /3 contains the offsets associated to each unit of 
the network. Setting /3 = sigm(6o) where 6o is the initial bias enforces the initial centering 
of the Boltzmann machine. From these equations, we can derive the conditional probability 

p{xi = l|x_i;6>) = sigm(6i ^^Wij{x - 13) j) 



of each unit and the gradient of the model log-likelihood with respect to W and 6: 
d 

:{\ogp{x] 6>))data = {{x - /3)(x - /3)^)data " {{x - /3)(x - /3)^)model 



dW 
d 

— (logp(x; 6>))data = (X - /3)data - {x - /3)model 



2.1 Stability of the Centered Boltzmann Machine 

In this section, we look at the stability of the underlying optimization problem. We argue 
that when the sigmoid is centered, the Hessian is better conditioned (see Figure [2j), and 
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Figure 1: Example of sigmoids with different biases and offsets. The three non-dashed 
sigmoids are said to be centered because they cross the origin. We show that 
centering sigmoids leads to a better conditioned Hessian. 



therefore, the learning algorithm is more stable. We define ^ as the centered state ^ — x — p. 
The derivative of the model log-likelihood with respect to the weight vector takes the form 

d 



dW 



(logp(x;6>)) 



data 



(ee 



^data - (CC )w 



where {•)w designates the expectation with respect to the probability distribution associated 
to a model of weight parameter W . Using the definition of the directional derivative, the 
second derivative with respect to a random direction V (which is equal to the projected 
Hessian HV) can be expressed as: 

d f d 
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From the last line, we can see that the Hessian can be decomposed into a data-dependent 
term and a data-independent term. A remarkable fact is that in absence of hidden units, the 
data-dependent part of the Hessian is zero, because the model — and therefore, the pertur- 
bation of the model — have no influence on the states. The conditioning of the optimization 
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Figure 2: Relation between the conditioning number Ai/A^ and the shape of the optimiza- 
tion problem. Gradient descent is easier to achieve when the conditioning number 
is small. 



problem can therefore be analyzed exclusively from the perspective of the model without 
even looking at the data. The data-dependent term is likely to be small even in the presence 
of hidden variables due to the sharp reduction of entropy caused by the clamping of visible 
units to data. 

We can think of a well-conditioned model as a model for which a perturbation of the 
model paramete r W in any d i rectio n V causes a well-behaved perturbation of state expec- 



tations (^^ )w- iPearlmutterl (j 19941 ) showed that in a Boltzmann machine with no hidden 



units, the projected Hessian can be further reduced to 
HV = {iOw ■ {D)w - {iCD)w 



where D 



levi 



(1) 



thus, getting rid of the limit and leading to numerically more accurate estimates. iLeCun et al.l 
(j 19981 ) showed that the stability of the optimization problem can be quantified by the con- 
ditioning number defined as the ratio between the largest eigenvalue Ai and the smallest 
eigenvalue A^ of H. A geometrical interpretation of the conditioning number is given in 
Figure [2l A low rank approximation of the Hessian can be obtained as 



H = H{Vo\ . . . IK) = {HVo\ . . . \HVn 



(2) 



where the columns of (Vb| . . . \Vn) form a basis of independent unit vectors that projects 
the Hessian on a low-dimensional random subspace. The conditioning number can then be 
estimated by performing a singular value decomposition of the projected Hessian H and 
taking the ratio between the largest and smallest resulting eigenvalues. 

We estimate below the conditioning number Ai/A^ of a fully connected Boltzmann 
machine of 50 units at initial state {W = 0) for different bias and offset parameters h and 
/3 using Equation [1] and [2l 



\l/\n 



/? = sigm(2) 

P = sigm(O) 

/3 = sigm(— 2) 



6 = 2 6 = 6 = -2 



2.26 21.97 839.59 
83.43 2.75 95.57 
866.00 22.95 2.24 



These numerical estimates clearly exhibit the better conditioning occuring when the 
sigmoid is centered. The more than 100- fold factor between the conditioning number of 
non-centered and centered Boltzmann machines is striking. 
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Figure 3: On the left, diagram of a two-layer deep Boltzmann machine along with its pa- 
rameters. On the right, different sampling methods: (i) a feed- forward pass on 
the network starting from a data point, (ii) the path followed by the alternate 
Gibbs sampler and (iii) the path followed by the alternate Gibbs sampler when 
the input is clamped to data. 



2.2 Centered Deep Boltzmann Machines 

For technical and practical reasons, it is common to introduce a structure to the Boltzmann 
machine by restricting the connections between its un i ts. A typical structure is the deep 
Boltzmann machine (DBM, ISalakhutdinov and Hintonl . [2009 ) in which units are organized 
in a deep layered architecture. The layered structure of the DBM has two advantages: first, 
it gives a specific role to units at each layer so that we can easily build top layer kernels 
that exploit the hierarchical structure of data. Second, the layered structure of the DBM 
can be folded into a bipartite graph from which it is easy to derive an efficient alternate 
Gibbs sampler. In the case of the two-layer deep Boltzmann machine shown in Figure [3l 
the energy function associated to each state (x,^,z) G {0, i^^x^My^M^ ^^]^q^ ^j^g form 



E{x,y,z;e) = 



{y - ^yW{x -a)-{z- ^yV{y - /?) 
(x - Q)^a -{y- /3)^6 - {z - 7)^c 



where 9 — {VF, V^ a, 6, c, a, /3, 7} groups the model parameters. Data,- independent states can 
be sampled using the following alternate Gibbs sampler: 



{x - B{sigm{W^{y - /?) + a)) 



Z r^ 



B{sigm{V{y - /3) + c))} 



y ~ B{sigin{W{x - a) + V~^ (z - 'j) + b)). 



(3) 
(4) 



The same Gibbs sampler can be used for sampling data- dependent states at the difference 
that the input units x are clamped to the data. We show below a basic algorithm based on 
persistent contrastive divergence for training a two-layer centered DBM: 



Basic algorithm for training a 2-layer centered DBM: 

ly, F = 0, 

a,b,c = sigm"^((x)data),^o,co 

a, /3, 7 = sigm(a), sigm(6), sigm(c) 

initialize free particle (xm, ym^ ^m) = (<^, /5, 7) 

loop 

initialize data particle {xd^Vd^Zd) = (pick(data),/3,7) 

loop 

Vd - B{sigm{W{xd - a) + V^ {zd - 7) + ^)) 
Zd^B{sigm{V{yd-p)^c)) 

end loop 

ym - B{sigm{W{xm - a) + V^(^m - 7) + ^)) 
Xm - S(sigm(I^^(y^ - /3) + a)) 
Zm ^ B{sigm{V{ym - P) ^ c)) 

W = W^r]- [{yd - P){xd - ay - {ym - P){xm - a)^] 
V = V^ri- [{zd - i){yd - py - {zm - l){ym - py] 
a = a^T] - {xd- Xm) 
b = b^r]-{yd-ym) 
c = c^r]- {Zd- Zm) 
end loop 



3. Discriminative Analysis 



We present the method introduced by iMontavon et alJ (1201 ll ) that measures how the rep- 



resentation evolves layer after layer in a deep network. It is based on the theoretical insight 
that the projection of the input distribution onto the hidden units of each layer provides a 
function space that can be thought of as a representation or a feature extractor. 

The method aims to characterize this function space by constructing a kernel for each 
layer that approximates the implicit transfer function between the input and the layer 

and measuring; how much these kernels "match" the task of interest. The approach is 

I 1 /I '\ 

theoretically motivated by the work of iBraun et al .' ("2008) sh owing that pro jections on the 



leading components of the implicit kernel feature map ( Scholkopf et al.Lll998l ) obtained with 



a finite and typically small number of samples n are close with essentially multiplicative 
errors to their asymptotic counterparts. In the following lines, we describe the principal 
steps of the analysis: 

Let X and T be two matrices of n rows representing respectively the inputs and labels 
of a data set of n samples. Let 



f : x^ /lo---o/i(x) 



be a deep network of L layers. We build a hierarchy of increasingly "deep" kernels 

ki^^{x,x') = K^{fi{x)Ji{x')) 

kL,a{x,x') = a>:^(/lo---o/i(x),/lo... o/i(xO) 

that subsume the mapping performed by more and more layers of the deep network and 
where Ka is an RBF kernel of scale a. For each kernel ki^cn we can compute the empirical 
kernel Ki^^r of size n x n and its eigenvectors uj^, . . . , uf^ sorted by decreasing magnitude 
of their respective eigenvalues A|^, . . . , Xf^. 

We measure how good a representation is with respect to a certain task by measuring 
whether the task is contained within the leading principal components of the representation. 
The matrix 



uL = (ul 



u 



l,a) 



spans the d leading kernel principal components of empirical kernel. The error is obtained 
as the residuals of the projection of the labels T on the d leading components of the mapped 
distribution: 



eTil,d,a) = \\T-UfMf..T\ 



F 



Curves (e(/, 0, a), ... , e(/, d, a)) represent how well the task can be solved as we add more 
and more principal components of the data distribution. These curves can be interpreted 
as learning curves as the regularization imposed by the rank of the kernel feature space 
determines the number of samples that are necessary in order to train the model effectively. 
Therefore, the number of observed kernel principal components d closely relates to the 
amount of label information given to the learning machine. Small values for d cover the "one- 
shot" learning regime where the model is asked to generalize from very few observations. 
On the other hand, large values for d cover the other extreme case where label information is 
abundant, and where the representation has to be rich enough in order to encode any subtle 
variation of the learning problem. For practical purposes, these curves can be reduced as 
follows: 



CT^h d) — min CT^h d^ cr) (5) 

a 
1 "" 

eT{l)^-Y.^T{hd) (6) 



n 



These compact measures of how well layer / represent T make it easier to compare the 
layer-wise evolution of the representation for different architectures. 

4. Generative Analysis 

Here, we present an analysis that estim ates the likelihood of the learned Boltzman n ma- 
chine (jSalakhutdinov and Hintonl . l2010l ) based on annealed importance sampling (AIS. lNeall . 



200 ll ). We describe here the basic analysis. ISalakhutdinov and HintonI (|20ld ) introduced 



more elaborate procedures for particular types of Boltzmann machines such as restricted, 
semi-restricted and deep Boltzmann machines. 

A deep Boltzmann machine associates to each input x a probability 



Z{0) 



where ^(0, x) = 2Zp^(^^ ^' ^' ^) 

y,z 



x,y,z 

and where p'^(x^ y^ z; 9) — e~^(^'^'^'^) is the unnormalized probability of state (x, y, z). Com- 
puting ^(0,x) and Z{6) analytically is intractable because of the exponential number of 
elements involved in the sum. Let us rewrite the ratio of partition functions as follows: 

r^ix-e)-^^^-^.^^^^ (7) 

It can be first noticed that the ratio of base-rate partition functions {0 = 0) is easy to 
compute as = makes units independent. It has the analytical form 

Z{0) 2^- • ^ ^ 

The two other ratios in Equation [7] can be estimated using annealed importance sampling. 
The annealed importance sampling method proceeds as follows: 



Annealed importance sampling: 

1. Generate a sequence of states ^i, . . . , ^r using a sequence of transition oper- 
ators T(C, C'; ^o), • • • 5 T{^, C'; Ok) that leave p{^) invariant, that is, 

• Draw ^0 from the base model (e.g. a random vector of zero and ones) 

• Draw ^1 given ^o using T{i,i';9i) 

• . . . 

• Draw ix given ^k-i using T{^,^';9k) 

2. Compute the importance weight 

P*{^i;Oi) P*(6;^2) p*{^k;0k) 



i^AIS 



p*(6;^o) P*(6;^i) P*i^K;0K- 



It can be shown that if the sequence of models Oq^Oi, . . . ^9k where ^o = and 9k = 
9 evolves slowly enough, the importance weight obtained with the annealed importance 
sampling procedure is an estimate for the ratio between the partition function of the model 
9 and the partition function of the base rate model. 

In our case, ^ denotes the state (x, y, z) of the DBM and the transition operator T(^, i'] 9) 
is the alternate Gibbs sampler defined in Equation [3l We can now compute the two ratios 
of partition functions of Equation [71 as 

|||«EKis] and ||l|| « E[i.Ais(x)] (9) 

where cjais is the importance weight resulting from the annealing process with the freely 
running Gibbs sampler and z^ais is the importance weight resulting from the annealing with 
input units clamped to the data point. Substituting Equation [5] and into Equation [TJ we 
obtain 



E[i^Alsl 2«. 

and therefore, the log-likelihood of the model is 

^x[\og{p{x-e))] « Ex[logE[i/Ais(x)]] - logEKis] - M^log(2). (10) 

Generally, computing an average of the importance weight z^ais foi" each data point x can 
take a long time. In practice, we can use an approximation to this computation where the 
estimate is computed with a single AIS run for each point. In that case, it follows from 
Jensen's inequality that 

Ex[logz^Ais(^)] - logE[cjAis] < Ex[logE[zyAis(^)]] - logEfcjAis]. (11) 

Consequently, this approximation tends to produce slightly pessimistic estimates of the 
model log-likelihood, however the variance of z/ais is low compared to the variance of cjais 
because the clamping of visible units to data points sharply reduces the diversity of AIS 
runs. We find that this approximation is sufficiently accurate for the purpose of this paper, 
that is, demonstrating the importance of centering deep Boltzmann machines. 

5. Experimental Setup 

In this section, we describe the different parameters used to train the deep Boltzmann 
machines and to perform the discriminative and generative analysis. These parameters 
correspond to reasonable choices, most of which have been validated by previous research 
work. 

Architecture We consider two-layer deep Boltzmann machines made of 784 input units, 
400 intermediate units and 100 top units. The initial biases and offsets for visible units 
are set to ag = sigm~-'^((x)data) ^ind a — sigm(a). We consider different initial biases 
(6o,co = —2, 605C0 = and 60, cq = 2) and offsets (/3,7 = sigm(— 2), /3,7 = sigm(O) and 
/3,7 = sigm(2)) for the hidden units. These offsets and initial biases correspond to the 
sigmoids plotted in Figure [H 



Data We train the DBMs on a binary version of the MNIST handwritten digits data set 
where the activation threshold is set to 0.5 (medium gray). The MNIST training set consists 
of 60,000 samples. Each sample is a binary image of size 28 x 28 representing a handwritten 
digit and is fed to the DBM as a 784-dimensional binary vector. 



Inference We use persistent contrastive divergence (JTielemanl . l2008l ) to train the network 



and keep track of 25 free particles in background of the learning procedure. We use a Gibbs 
sampling estimation to collect both the data-independent and data-dependent statistics. 
The rationale for this is that the more classical mean field estimation of data statistics 
(ISalakhutdinov and Hintonl . l2009l ) tends to artificially drive the DBM to sparsity due to 



the convex/concave shape of the sigmoid function. At each step of the learning procedure, 
we run 5 iterations of the alternate Gibbs sampler for collecting the data-dependent statistics 
and one iteration for updating the data-independent statistics. 

Learning We use a stochastic gradient descent on the approximate log-likelihood with 
minibatches of size 25 and a learning rate 77 = 0.0005 for each layer. For practical purposes, 
the minibat ch size i s set equivalent to the number of particles for persistent contrastive 
divergence fJHintonl . [2OI0I 1. We consider models trained for 10°, 10°•^ 10\ 10^-^ and 10^ 
epochs. 



Model averaging We use a var i ant of ave raged stochastic gradient descent fjPolyak and Juditskyl . 



1992l : lTieleman and Hintonl . l2009l : IXul . 1201 ll ) for reducing the parameter noise. We compute 
at each step k the new parameter estimate ^avg ^ ^:^ • ^ + ^:p^ • ^avg with fee = 10 in 
order to only remember the last 10% of the training procedure. 

Discriminative analysis The analysis is performed on a subset of 500 samples drawn 
randomly from the MNIST test set. Representations at each layer are built by running 
a Gibbs sampler for 100 iterations with the input clamped to data and taking the mean 
activation of each unit. Discriminative performance is measured as the projection residuals 
of the labels (see Equation [5j) and the area under the error curve (see Equation [6]) . Results 
are produced with candidate scale parameters of the Gaussian kernel a^ = 1, 10, 100, 1000 
and 10000. 

Generative analysis The generative analysis is performed on a subset of 500 samples 
drawn randomly from the MNIST test set. Generative performance is measured as the 
estimated log-likelihood of the model given the test data (see Equation [TOl) . We estimate 
the partition function Z(0)/Z(O) using 500 AIS runs. We estimate each 500 partition 
functions ^(0, x)/^(0, x) using a single AIS run. Each AIS run has length K = 2500 where 
model parameter at the k^^ step of the annealing process is defined as 0/^ = 1 — (1 — kO/K)'^. 
This sequence of parameters implies that annealing starts with large parameter updates 
and finishes with very small updates. 

6. Results 

Table [1] corroborates the importance of centering for better discriminating in the top layer of 
a deep Boltzmann machine. As it can be seen in Figure [5] (left), discriminative performance 
of the top layer can be further improved by training the network for a longer time. 

10 



AUG error 



/3,7 = sigm(2) 
/3,7 = sigm(0) 
^,7 = sigm(-2) 



60, Co = 2 60, Co = 60, Co 



0.119 0.194 0.285 

0.133 0.090 0.127 

0.368 0.323 0.114 



Table 1: Discriminative performance after 10 epochs in the top layer of the deep Boltzmann 
machine as measured by Equation [6] for different configurations of initial bias and 
offset. The lower the AUG error the better. In each case, centering sigmoids leads 
to better discrimination in the top layer. 



(logp(x;(9))data 



/3,7 = sigm(2) 
/3,7 = sigm(0) 
/3,7 = sigm(-2) 



60, Co = 2 60, Co = 60, Co 



-81.51 -86.5* -88.9* 

-83.5*" -81.2* -85.6* 

-88.1* -83.3* -80.4* 



Table 2: Generative performance after 10 epochs in terms of estimated model log-likelihood 
{logp{x; 0))data foi" different configurations of initial bias and offset. The generative 
performance is less sensitive to the initial conditioning of the DBM than the top 
layer discriminative performance as the top-level units can simply be discarded, 
leading essentially to a more robust one-layer generative model. 



Table [2] further supports the importance of centering, showing that centered DBMs 
learn a better generative model of data. However, the advantage is not as strong as for the 
discriminative case. Indeed, units in the top layer are not critical for generative performance 
as the learning algorithm can simply discard them and learn a one-layer shallow generative 
model instead. 

Figure H] and [5] highlight the importance of centering for faster and more stable learning. 
The models emerging from the centered deep Boltzmann machine have systematically better 
discriminative properties in the top layer and good generative properties. While a non- 
centered DBM may ultimately learn a model which is as good as the one produced by a 
centered DBM, it may also diverge. 

Figure [6] and [71 show that each model is able to learn reasonable first-layer filters but 
that second-layer filters learned by a centered DBM tend to be more varied than those 
learned by a non-centered DBM. This higher variety of second layer filters suggests that the 
centered DBM produces a richer top-level representation. The argument is corroborated by 
Figure [9] showing that, in absence of centering mechanism, the projection of the data on the 
top layer representation tends to form a simplistic low-dimensional manifold that may still 
contain useful features (for example, discriminating the digit "1" from other digits) but, 
on the other hand, that also discards a lot of potentially useful discriminative features. As 



In some other research work, authors are computing a lower bound of the log probablity instead of a 
direct estimate of it, thus making a direct comparison impossible. Also, estimates of log probability become 
increasingly inaccurate as the model complexifies. 
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Figure 4: On the left, residuals of the projection of the labels in the leading components of 
the top layer kernel A:2(x, x') (see Equation[5]) after 10 epochs. On the right, layer- 
wise evolution of the representation in terms of area under the error curve (see 
Equation [6]) after 100 epochs. Centered DBMs are more stable than non-centered 
ones. Top layer representations are clearly better than the input. 
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Figure 5: Convergence speed of centered and non-centered DBMs (in terms of top layer 
AUC error and model log-likelihood). Centered DBMs learn faster and are more 
stable than non-centered ones. Note that the estimate of the log-likelihood from 
Equation [101 becomes inaccurate as the model becomes more complex (after 10 
epochs). 
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Figure 6: Examples of first-layer filters of the DBM for different bias and offset parameters 
after 100 epochs. These filters are rendered using a linear backprojection of top 
layer units onto the input space. Each model is producing reasonable first-layer 
filters, suggesting that one-layer networks (i.e. restricted Boltzmann machines) 
are less sensitive to the quality of the conditioning of the parameter space. 
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Figure 7: Examples of second-layer filters of the DBM for different bias and offset parame- 
ters after 100 epochs. These filters are rendered using a linear backprojection of 
intermediate layer units onto the input space. Here, we can clearly see that the 
diversity of filters is higher when the DBM is centered. 
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Figure 8: Examples of digits generated by the DBM for different bias and offset parameters 
after 10 epochs. The degenerated second layer of the non-centered DBM seems 
to have a negative impact on the balance between different classes. 
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Top layer representation after 1 epoch 
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Figure 9: 2-kPCA visualization of the top-level representation in the DBM for different 
bias and offset parameters at different stages of training. Points are colored ac- 
cording to their label ("0"=red, "l"=blue, "2"=green, "3"=yellow, "4"=orange, 
"5"=black, "6"=brown, "7"=gray, "8"=magenta, "9"=cyan). Non-centered 
DBMs tend to collapse the data onto a simplistic low-dimensional manifold in 
the top layer representation. On the other other hand, in the centered DBM, 
we can clearly observe in the late stage of training the emergence of clusters 
corresponding to labels. 
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suggested by Figure El the top-layer simplistic representation may even negatively affect 
the generative properties of the model by perturbing the balance between different classes. 

7. Conclusion 

We presented a simple modification of the deep Boltzmann machine that centers the output 
of the sigmoids by rewriting the energy function as a function of centered states. This 
centered version of the deep Boltzmann machine is easy to implement as it simply involves 
a reparameterization of the energy function. A theoretical motivation for centering is that 
it leads to a better conditioning of the Hessian of the optimization criterion. 

This simple modification allows to learn efficiently a deep Boltzmann machine without 
greedy layer-wise pretraining. Experiments on real data corroborate the benefits of center- 
ing, showing that the centered deep Boltzmann machine learns faster and is more stable than 
its non-centered counterpart. In addition, the centered deep Boltzmann machine produces 
useful discriminative features in the top layer and a good generative model of data. 

Training hierarchies of many layers is still tedious and requires many iterations. Un- 
derstanding whether the difficulty comes from a difficult optimization problem or from the 
exhaustion of statistical information in the data set remains to be done. Also, despite an 
initial good conditioning of the Hessian, it can not be excluded that the solution progres- 
sively drifts towards degenerate regions of the parameter space throughout the learning 
procedure. Strategies to dynamically maintain the solution within well-behaved regions of 
the parameter space or to better descend the objective function also need to be further 
investigated. 
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